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I. Pending Claims 

Claims 1, 2, 10, 11, 30, and 43-57 are currently pending. Claims 10, 30, 46, 48, 51, 52 and 
54-57 are being actively prosecuted. By this amendment, Claim 10 has been amended. 

Applicants expressly do not disclaim the subject matter of any invention disclosed herein which 
is not set forth in the instantly filed claims. Applicants reserve the right to prosecute the non-elected 
claims in subsequent divisional applications. 

II. Su pport for the Amendments 

Claim 10 has been amended to incorporate the limitations of non-elected claim 1, from which it 
had depended. Support for this amendment may be found in the specification at page 3, lines 3-4, 
wherein it is set forth that the present invention features antibodies that bind specifically to hLC3. No 
new matter is added by this amendment. Applicants are amending the claim solely to obtain 
expeditious allowance of the instant application. 

III. Rejoinder 

The Examiner is reminded that method claims 43-46, 50 and 53 should be rejoined per the 
Commissioner's Notice in the Official Gazette of March 26, 1996, entitled "Guidance on Treatment of 
Product and Process Claims in light of In re Ochiai, In re Brouwer and 35 U.S.C. § 103(b)" which 
sets forth the rules, upon allowance of any of the product claims, for rejoinder of process claims 
covering the same scope of products. Applicants request that claims 43-46, 50 and 53 be rejoined and 
examined upon allowance of the product claims. 

IV. Claim Objections 

The Examiner has objected to claims 10, 30, 46, 48, 51-52, and 54-57 as each claim depends 
from non-elected inventions. Applicants have amended claim 10 to incorporate the limitations of non- 
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elected claim 1 from which it had depended. Accordingly, it is submitted that this amendment obviates 
claim 10 from this objection. 

With respect to claims 51-52 and 54-57, claims 51 and 54 dependent on claims 50 and 53, 
respectively. Claims 50 and 53 are directed to the method of making of the antibodies of claims 51 and 
54. Accordingly, upon allowance of claims 51 and 54, method claims 50 and 53 will be rejoined, 
examined and also allowed. Hence, claims 51 and 54 will not depend from non-elected inventions and 
this objection will be moot as to claims 51-52 and 54-57. 

For all the above reasons, reconsideration and withdrawal of this objection are respectfully 
requested. 



V. Claim Rejections - 35 U.S.C. § 112. first paragraph 

Claims 10, 30, 46, 48, 51-52, and 54-57 have been rejected under 35 U.S.C. § 1 12, first 
paragraph, as being based on a specification which allegedly fails to reasonably convey to one of skill in 
the art that the Applicants had possession of the claimed invention at the time the application was filed 
for the reasons found on pages 3 and 4 of the Office Action. This rejection is respectfully traversed. 

The requirements necessary to fulfill the written description requirement of 35 U.S.C. 112, first 
paragraph, are well established by case law. 

... the applicant must also convey with reasonable clarity to those skilled in the art 
that, as of the filing date sought, he or she was in possession of the invention. The 
invention is, for purposes of the "written description" inquiry, whatever is now 
claimed, Vas-Cath, Inc. v. Mahurkar, 19 USPQ2d 1111, 1117 (Fed. Cir. 1991) 
(emphasis added). 

Attention is also drawn to the Patent and Trademark Office's own "Guidelines for Examination 
of Patent Applications Under the 35 U.S.C. Sec. 112, para. 1," published January 5, 2001, which 
provide that: 

An applicant may also show that an invention is complete by disclosure of sufficiently 
detailed, relevant identifying characteristics which provide evidence that applicant was in 
possession of the claimed invention, i.e., complete or partial structure, other physical 
and/or chemical properties, functional characteristics when coupled with a known or 
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disclosed correlation between function and structure, or some combination of such 
characteristics. What is conventional or well known to one of ordinary skill in the art 
need not be disclosed in detail. If a skilled artisan would have understood the inventor 
to be in possession of the claimed invention at the time of filing, even if every nuance of 
the claims is not explicitly described in the specification, then the adequate description 
requirement is met {footnotes omitted). 

Thus, the written description standard is fulfilled by both what is specifically disclosed and what 
is conventional or well known to one skilled in the art. 

The Examiner's attention is drawn to the language of amended claim 10, which recites 
antibodies that bind to polypeptides "comprising a naturally occurring amino acid sequence at least 
90% identical to the amino acid sequence of SEQ ID NO: 1" (i.e., "variants" of SEQ ID NO: 1), and 
that this polypeptide binds to microtubules. First, note that the polypeptide sequence of SEQ ID NO:l, 
to which the claimed antibodies bind, is specifically disclosed in the application (see, for example, the 
sequence listing at page 49, which shows the amino acid sequence of SEQ ID NO: 1, and the 
description at page 11, lines 5-13). Variants, and in particular, naturally occurring variants, at least 
90% identical to SEQ ID NO:l, are described at page 11, lines 14-17. Incyte clones in which the 
nucleic acids encoding hLC3 were first identified, and libraries from which those clones were isolated, 
are described, for example, at page 10, line 29 through page 11, line 4 of the specification. Chemical 
and structural features of hLC3 are described, for example, on page 11, lines 6-11. 

Clearly, one of ordinary skill in the art would recognize polypeptide sequences that are naturally 
occurring variants at least 90% identical to SEQ ED NO:l and that binds to microtubules. Given any 
particular naturally occurring polypeptide sequence, it would be routine for one of skill in the art to 
recognize whether it was a variant of SEQ ID NO: 1 . It would also be routine to determine whether 
such a variant had hLC3 activity, using the disclosed hLC3 activity assay (specification at page 46, 
Example IX). Accordingly, the specification provides an adequate written description of the structure 
of the genus of polypeptides to which the claimed antibodies bind. 
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1. The present claims specifically define the claimed genus through the recitation 
of chemical structure 

Court cases in which "DNA claims" have been at issue (which are hence relevant to claims to 
proteins encoded by the DNA and antibodies which specifically bind to the proteins) commonly 
emphasize that the recitation of structural features or chemical or physical properties are important 
factors to consider in a written description analysis of such claims. For example, in Fiers v. Revel, 25 
USPQ2d 1601, 1606 (Fed. Cir. 1993), the court stated that: 

If a conception of a DNA requires a precise definition, such as by structure, formula, 
chemical name or physical properties, as we have held, then a description also requires 
that degree of specificity. 

In a number of instances in which claims to DNA have been found invalid, the courts have 
noted that the claims attempted to define the claimed DNA in terms of functional characteristics without 
any reference to structural features. As set forth by the court in University of California v. Eli Lilly 
and Co., 43 USPQ2d 1398, 1406 (Fed. Cir. 1997): 

In claims to genetic material, however, a generic statement such as "vertebrate insulin 
cDNA" or "mammalian insulin cDNA," without more, is not an adequate written 
description of the genus because it does not distinguish the claimed genus from others, 
except by function. 

Thus, the mere recitation of functional characteristics of a DNA, without the definition of 
structural features, has been a common basis by which courts have found invalid claims to DNA. For 
example, in Lilly, 43 USPQ2d at 1407, the court found invalid for violation of the written description 
requirement the following claim of U.S. Patent No. 4,652,525: 

1. A recombinant plasmid replicable in procaryotic host containing within its nucleotide 
sequence a subsequence having the structure of the reverse transcript of an mRNA of a 
vertebrate, which mRNA encodes insulin. 

In Fiers, 25 USPQ2d at 1603, the parties were in an interference involving the following count: 
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A DNA which consists essentially of a DNA which codes for a human fibroblast 
interferon-beta polypeptide. 

Party Revel in the Fiers case argued that its foreign priority application contained an adequate 
written description of the DNA of the count because that application mentioned a potential method for 
isolating the DNA. The Revel priority application, however, did not have a description of any particular 
DNA structure corresponding to the DNA of the count. The court therefore found that the Revel 
priority application lacked an adequate written description of the subject matter of the count. 

Thus, in Lilly and Fiers, nucleic acids were defined on the basis of functional characteristics 
and were found not to comply with the written description requirement of 35 U.S.C. §112; i.e., "an 
mRNA of a vertebrate, which mRNA encodes insulin" in Lilly, and "DNA which codes for a human 
fibroblast interferon-beta polypeptide" in Fiers. In contrast to the situation in Lilly and Fiers, the 
claims at issue in the present application define antibodies which specifically bind to polypeptides in 
terms of chemical structure, rather than merely functional characteristics. For example, the language of 
independent claim 10, as amended, recites chemical structure to define the claimed genus: 

10. An isolated antibody which specifically binds to a polypeptide of claim 1 
selected from the group consisting of: 

a) a polypeptide comprising the amino acid sequence of SEQ ID NO: 1, 

b) a polypeptide comprising a naturally occurring amino acid sequence at least 
90% identical to the amino acid sequence of SEQ ID NO: 1, said polypeptide 
binds to microtubules, and 

c) a fragment of a polypeptide having the amino acid sequence of SEQ ID 
NO:l, said fragment binds to microtubules. 

From the above it should be apparent that the claims of the subject application are 
fundamentally different from those found invalid in Lilly and Fiers. The subject matter of the present 
claims is defined in terms of the chemical structure of SEQ ID NO: 1. In the present case, there is no 
reliance merely on a description of functional characteristics of the antibodies which specifically bind to 
the polypeptides recited by the claims. The antibodies which specifically bind to the polypeptides 
defined in the claims of the present application recite structural features, and cases such as Lilly and 
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Fiers stress that the recitation of structure is an important factor to consider in a written description 
analysis of claims of this type. By failing to base its written description inquiry "on whatever is now 
claimed," the Office Action failed to provide an appropriate analysis of the present claims and how they 
differ from those found not to satisfy the written description requirement in Lilly and Fiers. 

2. The present claims do not define a genus which is "highly variant" 

Furthermore, the claims at issue do not describe a genus which could be characterized as 
"highly variant." Available evidence illustrates that the claimed genus is of narrow scope. 

In support of this assertion, the Examiner's attention is directed to the enclosed reference by 
Brenner et al. ("Assessing sequence comparison methods with reliable structurally identified distant 
evolutionary relationships," Proc. Natl. Acad. Sci. USA (1998) 95:6073-6078). Through exhaustive 
analysis of a data set of proteins with known structural and functional relationships and with <90% 
overall sequence identity, Brenner et al. have determined that 30% identity is a reliable threshold for 
establishing evolutionary homology between two sequences aligned over at least 150 residues. 
(Brenner et al., pages 6073 and 6076.) Furthermore, local identity is particularly important in this case 
for assessing the significance of the alignments, as Brenner et al. further report that ^40% identity over 
at least 70 residues is reliable in signifying homology between proteins (Brenner et al., page 6076). 

The present application is directed, inter alia, to antibodies that bind to naturally occurring 
hLC3 proteins related to the amino acid sequence of SEQ ID NO:l. In accordance with Brenner et 
al., naturally occurring molecules may exist which could be characterized as hLC3 proteins and which 
have as little as 40% identity over at least 70 residues to SEQ ID NO:l. The "variant language" of the 
present claims recites, for example, antibodies which specifically bind to the polypeptides encoding "a 
naturally occurring amino acid sequence having at least 90% sequence identity to the sequence of SEQ 
ID NO:l" (note that SEQ ID NO:l has 121 amino acid residues). This variation is far less than that of 
all potential hLC3 proteins related to SEQ ID NO:l, i.e., those hLC3 proteins having as little as 40% 
identity over at least 70 residues to SEQ ID NO: 1. 



114243 



12 



09/904,603 



Docket No.: PF-0211-2 DIV 

3. The state of the art at the time of the present invention is further advanced than 
at the time of the Lilly and Fiers applications 

In the Lilly case, claims of U.S. Patent No. 4,652,525 were found invalid for failing to comply 
with the written description requirement of 35 U.S.C. §112. The '525 patent claimed the benefit of 
priority of two applications, Application Serial No. 801,343 filed May 27, 1977, and Application Serial 
No. 805,023 filed June 9, 1977. In the Fiers case, party Revel claimed the benefit of priority of an 
Israeli application filed on November 21, 1979. Thus, the written description inquiry in those case was 
based on the state of the art at essentially at the "dark ages" of recombinant DNA technology. 

The present application has a priority date of February 24, 1997. Much has happened in the 
development of recombinant DNA technology in the 20 years from the time of filing of the applications 
involved in Lilly and Fiers and the present application. For example, the technique of polymerase 
chain reaction (PCR) was invented. Highly efficient cloning and DNA sequencing technology has been 
developed. Large databases of protein and nucleotide sequences have been compiled. Much of the 
raw material of the human and other genomes has been sequenced. With these remarkable advances 
one of skill in the art would recognize that, given the sequence information of SEQ ID NO: 1, and the 
additional extensive detail provided by the subject application, the present inventors were in possession 
of the claimed polynucleotide variants at the time of filing of this application. 

B. No description of the function of the polypeptides is required to satisfy the written 
description requirement for the claimed antibodies 

In support of his assertion that, regarding part b) of claim 10, the claimed genus of antibodies 
has not been fully described in the specification, the Examiner has asserted that the specification does 
not contain any disclosure of the function of all of the naturally occurring polypeptide sequences that are 
at least 90% identical to SEQ ID NO: 1 (Office Action mailed August 25, 2003, page 4). Applicants 
respectfully remind the Examiner that disclosure of functional characteristics is merely one of the factors 
which can be used as evidence that Applicants were in possession of the claimed invention at the time 
of filing. For at least the reasons set forth above in sections A(l) - A(3), Applicants have provided an 
adequate written description of the claimed antibodies which specifically bind to a polypeptide 
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comprising a naturally occurring amino acid sequence at least 90% identical to the amino acid sequence 
of SEQ ID NO:l. Accordingly, this rejection should be withdrawn. 

C. Summary 

The Office Action failed to base its written description inquiry "on whatever is now claimed." 
Consequently, the Office Action did not provide an appropriate analysis of the present claims and how 
they differ from those found not to satisfy the written description requirement in cases such as Lilly and 
Fiers. In particular, the claims of the subject application are fundamentally different from those found 
invalid in Lilly and Fiers. The subject matter of the present claims is defined in terms of the chemical 
structure of SEQ ID NO:l. The courts have stressed that structural features are important factors to 
consider in a written description analysis of claims to nucleic acids and proteins. In addition, the genus 
of antibodies which specifically bind to the polypeptides defined by the present claims is adequately 
described, as evidenced by Brenner et al. and consideration of the claims of the '740 patent involved in 
Lilly. Furthermore, there have been remarkable advances in the state of the art since the Lilly and 
Fiers cases, and these advances were given no consideration whatsoever in the position set forth by the 
Office Action. 

VI. Claim Rejections - 35 U.S.C. § 102 

Claims 10, 46, 48, 51-52, and 54-57 stand rejected under 35 U.S.C. § 102(b) as allegedly 
being anticipated by Mann et al. (Journal of Biological Chemistry Vol. 269, No. 15, pp 1 1492-1 1497). 

In particular, the Office Action asserts that Mann et al. set forth an amino acid sequence that 
"...contains 34 consecutive amino acids in common with SEQ ID NO:l, giving an 83.4% overall 
match" (Office Action mailed August 25, 2003, page 5). The Examiner states that the reference 
"...appears to disclose the same antibody claimed by applicants" (Office Action mailed August 25, 
2003, page 6). Applicants respectfully disagree and traverse the rejection. 

The scope of the claim as amended requires that the isolated antibody bind specifically to a 
polypeptide comprising the amino acid sequence of SEQ ID NO:l, a polypeptide comprising a 
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naturally occurring amino acid sequence at least 90% identical to the amino acid sequence of SEQ ID 

NO: 1, said polypeptide binds to microtubules, and a fragment of a polypeptide having the amino acid 

sequence of SEQ ID NO:l, said fragment binds to microtubules. If an antibody also binds to rat Light 

Chain 3 (LC3), then it falls outside the scope of the claims since Claim 10 requires it to bind specifically 

only to recited polypeptides. Therefore, it appears that the Examiner's position is based upon a 

misinterpretation of the scope of claim 10. 

The M.P.E.P. is clear in its instruction regarding determination of the scope of a claim: 

The breadth of the claims in the application should always be carefully noted; that is, the 
examiner should be fully aware of what the claims do not call for, as well as what they 
do require. During patent examination, the claims are given the broadest reasonable 
interpretation consistent with the specification. See In re Morris, 127 F.3d 1048, 
44USPQ2d 1023 (Fed. Cir. 1997). (M.P.E.P. § 904.01, emphasis in original). 

Substantially, every claim includes within its breadth or scope one or more 
variant embodiments that are not disclosed in the application, but which would 
anticipate the claimed invention if found in a reference. The claim must be so 
analyzed and any such variant encountered during the search should be recognized. 
(M.P.E.P. § 904.01(a); emphasis added). 

Moreover the 34 amino acids fragment of rat LC3 that is in common with SEQ ID NO: 1 does 
not fall within the scope of the recited polypeptides to which the claimed isolated antibody binds 
specifically. The Examiner has not shown that the 34 amino acid fragment of rat LC3 would be 
sufficient to bind microtubules. Accordingly, the 34 amino acid fragment of rat LC3 does fall with the 
scope of the recited polypeptides of claim 10 to which the claimed isolated antibody binds specifically. 

Applicants respectfully call the Examiner's attention to M.P.E.P. § 706.02, which states that 
"for anticipation under 35 U.S.C. 102, the reference must teach every aspect of the claimed 
invention either explicitly or impliedly. Any feature not directly taught must be inherently present" 
(emphasis added). Furthermore, M.P.E.P. § 21 12 states that "[t]he fact that a certain result or 
characteristic may occur or be present in the prior art is not sufficient to establish the inherency of that 
result or characteristic" (emphasis in original). M.P.E.P. § 2112 further states that "[t]he examiner 
must provide a basis in fact and/or technical reasoning to reasonably support the determination that the 
allegedly inherent characteristic necessarily flows from the teachings of the applied prior art" 
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(emphasis in original). Yet the current Office Action provides no such basis or reasoning. Accordingly, 
for at least all the above reasons., Applicants respectfully request reconsideration and withdrawal of the 
rejection. 

VII. Claim Rejections - 35 U.S.C. § 103 

Claims 10, 46, 48, 51-52, and 54-57 have been rejected under 35 U.S.C. § 103(a) as 
allegedly being unpatentable over Mann et al. in view of Queen et al. 

The Examiner alleges that it would have been prima facie obvious to one of ordinary skill in the 
art at the time of the invention to have humanized the antibody disclosed by Mann et al. by the method 
taught by Queen et al. (Office Action mailed August 25, 2003, page 8). 

As Applicants have submitted above, Mann et al. fails to teach or suggest an isolated antibody 
bind specifically to a polypeptide comprising the amino acid sequence of SEQ ID NO:l, a 
polypeptide comprising a naturally occurring amino acid sequence at least 90% identical to the amino 
acid sequence of SEQ ID NO:l, said polypeptide binds to microtubules, and a fragment of a 
polypeptide having the amino acid sequence of SEQ ED NO:l, said fragment binds to microtubules. 

Furthermore, Queen et al. fails to remedy the deficient of Mann et al. Accordingly, the 
combined cited prior art does not teach or suggest an antibody that binds specifically to only the 
polypeptides recited in claim 10. For at least the above reasons , Applicants respectfully request 
reconsideration and withdrawal of the rejection. 
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CONCLUSION 

In light of the above amendments and remarks, Applicants submit that the present application is 
fully in condition for allowance, and request that the Examiner withdraw the outstanding 
objections/rejections. Early notice to that effect is earnestly solicited. 

If the Examiner contemplates other action, or if a telephone conference would expedite 
allowance of the claims, Applicants invite the Examiner to contact the undersigned at the number 
listed below. 

Applicants believe that no fee is due with this communication. However, if the USPTO 
determines that a fee is due, the Commissioner is hereby authorized to charge Deposit Account 
No. 09-0108. 



Date: 



Respectfully submitted, 

CORPORATION 




Vema, Ph.D. 
ig. No. 33,287 
Direct Dial Telephone: (650) 845-5415 



Customer No.: 27904 
3160 Porter Drive 
Palo Alto, California 94304 
Phone: (650) 855-0555 
Fax: (650) 849-8886 



Attachment(s): 

1. Brenner et al. (Proc. Natl. Acad. Sci. USA (1998) 95:6073-6078). 
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ABSTRACT Pairwise sequence comparison methods have 
been assessed using proteins whose relationships .re known 
reliably from their structures and functions, as described in 
the scop database [Murzin. A. C Brenner. S. E_ Hubbard T 
4 Chothia C. (1995) / Met. Bid. 247, 536-540]. The ev.lu.-' 
Uon tested the programs blast (AJuchul, S. F. Gish W 
Miller, W„ Myers. E. W. & Lipman. D. J. (1990). J. Mol. Biol 
215, 403-410], WU-BLASTI [Altschul. S. F. & Gish W (199 6 i 
Methods Em,mol. 266, 460-48OJ, facta rPearson. W. R. 4 
Lipman, D. J. (1988) Proc. Natl.Acad. Sci. USA 85 2444-24481 

£ ZLf^ 1 1' * WaUrB »»' M. S. (1981) J. Mo)'. 
BuU 147, 195-107) .„ d their scoring scheme,. The error rate 
of all algorithms is greatly reduced by using statistical scores 
to evaluate matches rather than percentage identity or raw 
* cor V: n * E - V » ,ue "»'i*Ucal scores of ssearch and Facta are 
reliable: the number of raise positives found in our tests agrees 
well with the scores reported. However, the P-values reported 
by blast and WU-BLAST2 exaggerate significance bv orders or 
magnitude SSEAJICH. Fasta letup = 1, and wu-blastj perform 
best, and they are capable or delecting almost all relationships 
between proteins whose sequence identities are >30* For 
more distantly related proteins, they do much less well; onlv 
one-hair or the relationships between proteins with 20-30% 
identity are found. Because many homoiogs have low sequence 
similarity, most distant relationships cannot be detected bv 
any pairwise comparison method; however, those which are 
identified may be used with confidence. 

Sequence database searching plays a role in v.nuallv everv 
branch of molecular biology and is crucial for interpreting the 
sequences issuing forth from genome projects. Given the 
method's central role, it is surprising that overall and relative 
capab, hues of different procedures are largely unknown. 1, j s 
difficult to verify algorithms on sample data because this 
requires large data sets of proteins whose evolutionary rela- 
tionships are known unambiguously and independentlv'of the 
methods being evaluated. However, nearly all known ho- 
mologs have been identified by sequence analysis (the method 
to be tested). Also, it is generally very difficult to know. j„ the 
absence of structural data, whether two proteins that lack clear 
sequence similarity are unrelated. This has meant that al- 
though previous evaluations have helped improve sequence 
comparison, they have suffered from insufficient, imperfectly 
characterized, or artificial test data. Assessment also has been 
problematic because high quality database sequence searching 
attempts to have both sensitivity (detection of homoiogs) and 
specificity (rejection of unrelated proteins); however these 
complementary goals are linked such that increasing one 
causes the other to be reduced. 

The pubhcat.on cons of thu .n.cle were defrayed in pan bv put charee 
p.»mem Thu .rude must therefor, be hereby m.rked • adrrnJemenr ,n 
Kcordance wiih 18 VS.C. 11734 solely lo mdmie this fact 

C I 9M b. The N..k.«I Aodem* of Scene 0027^424/98/»S6073<J2.oo/0 
PNAS a avulaMt online it http://iwiw.priM.oig 



Sequence comparison methodologies have evolved rapidly 
so no previously published tests has evaluated modem vemow 

bL£ m hL^r 0 ^ For exam P' E - P" am «»* » 
o^f J ■ g ! d - Md WU " B ^ ST: (2)— which produces 

gapped alignments-has become available. The latest vers.on 
fy^i 0 ^)P^v.ouslv tested was 1.6. but the current release 
version 3.0) provides fundamentally different results in the 
form of statistical scoring. 

Fo?e e »mrl 0US ,., repO ? "£ h " Ve ' Cf ' * a P* m our knowledge. 
threshS • ha *. been "° P ub,,shed -wnwil of 

thresholds for scoring schemes more sophisticated than per- 

™ t mty - ^ ,be Wide, y discus * d coring 
T h f VC neVM 5f~»^ eva '««« d on large d™.? 
bases of real proteins. Moreover, the different scoring schemes 
commonly m use have not been compared 

in !n V °k d , heSe iMUeS - there » a more fundamental question 
in an absolute sense, how well does pa.rw.se sequence com 
parison work? That is. wha, fraction of homo.og^Tpro ™ 
can be detected using modem database searching methods^ 
In this work, we attempt to answer these questions and to 
overcome both of the fundamental difficulties tha. have h.n 
dered assessment of sequence comparison methodologies 
Irnl V, "* ,he , ° f d - i " am evoluli °n»ry relationships in 
cop Structural Class.fication of Proteins database (4) wh ch 
« denved from structural and functional characteristics (5) 

mologs. which are known independently of sequence compar- 

ures bo,°h ,; We , USe " •? MSmem me,hod ,ha ' i°""'v «»• 
1^2? "ns.t.y.ty and specificity This method 'allows 
straightforward comparison of different sequence searching 
procedures. Further, i, can be used to aid inteV^ion of r«l 

results'" $M thUS Pr ° Vide °P ,imal and re, «W« 

Previous Assessments of Sequence Comparison. Several 
previous siud.es have examined the rela,,ve P performance o 
different sequence comparison methods. The most encom- 
passing analyses have been by Pearson (6. 7). who compared 
tne three most commonly used programs. Of these, the Smith- 
Waterman algorithm (8) implemented in ssearch (3) is the 
oldest and slowest but the most rigorous. Modem heuristics 
have provided blast (1) the speed and convenience to make 
rA^A? 0 ^ la [P ro ? ram Imermediate between these .wo 
FASTA < 3 >- *hich may be run in two modes offering either 
greater speed (ktup - 2) or greater effectiveness (ktup - j ) 
Pearson also considered different parameters for each of these 
programs 

To test the methods. Pearson selected two representative 
proteins from each of 67 protein superfamilies defined bv the 
P.R database (9) Each was used as a query to search the 
database, and the matched proteins were marked as be.ns 
homologous or unrelated according to their membership of pir 

Abbrevunon EPQ. errors per query 
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superfamilies. Pearson found that modem matrices and "In- 
scaling" of raw scores improve results considerably. He also 
reported that the rigorous Smith-Waterman algorithm worked 
slightly better than facta, which was in turn more effective 
than BLAST. 

Mm"* SCa J e ,? na,y ? e f,° f ma,rices have ** m performed 
(10), and Henikoff and Henikoff (11) also evaluated the 
effectiveness of blast and fasta. Their test with blast 
considered the ability to detect homologs above a predeter- 
mined score but had no penalty for methods which also 
reported large numbers of spurious matches. The Henikoffs 
searched the swiss-prot database (12) and used prosite (13) 
to define homologous families. Their results showed that the 
BLOSUM62 matrix (14) performed markedly better than the 
extrapolated PAM-series matrices (15), which previously had 
been popular. 

A crucial aspect of any assessment is the data that are used 
to test the ability of the program to find homologs. Bui in 
Pearson's and the Henikoffs' evaluations of sequence com- 
parison, the correct results were effectively unknown. This is 
because the superfamilies in pir and PROsrrt are principally 
created by using the same sequence comparison methods 
which are being evaluated Interdependent of data and 
methods creates a "chicken and egg" problem, and means for 
example, that new methods would be penalized for correctly 
identifying homologs missed by older programs. For instance 
immunoglobulin variable and consiant domains are clearly 
homologous, but pir places them in different superfamilies 
The problem is widespread: each superfamilv in pir 48 00 with 
a structural homolog is itself homologous to an average of 1 6 
other pir superfamilies (16). 

To surmount these sorts of difficulties. Sander and Schnei- 
der (17) used protein structures to evaluate sequence com- 
parison. Rather than comparing different sequence compari- 
son algorithms, their work focused on determining a length- 
dependent threshold of percentage identity, above which all 
proteins would be of similar structure. A result of this analysis 
was the hssp equation; it states that proteins with 25% identity 
over 80 residues will have similar structures, whereas shorter 
alignments require higher identity. (Other studies also have 
used structures (18-20). but these focused on a small number 
of model proteins and were principally oriented toward eval- 
uating alignment accuracy rather than homology detection ) 
A general solution to the problem of scoring comes from 
statistical measures (i.e.. E-values and P-values) based on the 
extreme value distribution (21). Extreme value scoring was 

Karlin and Altschul statistics (22. 23) and empirical ap- 
proaches have been recently added to fasta and ssearch In 
addition to being heralded as a reliable means of recognizing 
significantly similar proteins (24. 25). the mathematical trac 
lability of statistical scores "is a crucial feature of the blast 
algorithm" ( 1 ). The validity of this scoring procedure has been 
ested anaK-tically and empirically (see ref. 2 and references in 
ref. 24). However, all large empirical tests used random 
sequences that may lack the subtle structure found within 
bio ogical sequences (26. 27) and obviously do not contain anv 
real homologs. Thus, although many researchers have sug- 
gested that statistical scores be used to rank matches (24 25 
28). there have been no large rigorous experiments on biolog" 
superior* 10 ,he degree 10 wh ' ch sueh '"kings are 

A Database for Testing Homology Detection. Since the 
discovery that the structures of hemoglobin and myoglobin are 
very similar though their sequences are not (29),' it has been 
apparent that comparing structures is a more powerful (if less 
convenient) way to recogntze distant evolutionary relation- 
ships than comparing sequences. If two proteins show a high 
degree of similarity in their structural details and function it 



P*vc. Nad. Acad. Set. USA 95 (1998) 

IZX r blblC ,hlt ,hCy h,Ve ,n "olutioiiirv relationship 
though their sequence similarity mav be low V 

W,2fl e ZV T ° Wth °[ P ro,ei0 «™ure information com- 
bined with the comprehensive evolutionary classification in 
he scop database (4. 5) have allowed us to overcome pr ™o£ 
l.m.tations W„ n these data, we can evaluate the perform»« 
of sequence comparison methods on real protein sequences 
whose relationships are known confidently. The scop database 
uses structural information to recognize distant homologs. the 
large majority o which can be determined unambiguous*. 

Iins would be recognized as related by the vut majority of the 
biological community despite the lack of high sequence \Z 

From scop, we extracted the sequences of domains of 
proteins -n the Protein Data Bank (pdb) (30) and crea, rfrw? 
databases. One (pdbmd-b) has domains, which were all <90% 
identica to any other, whereas (pdwod-b) had those <40ft 
■dentical. The databases were created by first s^ing 5 
protein domains in scop by their qualirv and making a list The 
highest quality domain was selected for inclusion in the 

£»S H^H hT^ h ,? m Val - rcmoved from the list 
and discarded) were all other domains above the threshold 
level of identity to the selected domain. This process w£ 
repeated until the list was empty. The pdemod-b database 
contains 1.j2, domains, which have 9.044 ordered pairs of 
distant relationships, or -0.5% of the total 1.749.006 ordered 
pain. In pdbwd-B. the 2.079 domains have 53.988 relation 
ships, representmg 1.2% of all pairs. Low complexity re£o« 

ma k2r C h TS Ch r' SpUri0US h * h ««•• «» 'hese we« 
masked ,n both databases by processing with the SEC program 

(27) using recommended parameters: 12 1.8 2.0. The databases 

«/ a ', C >VaiJ ^ ,e from h "P7/sss.s,.nford.edu/ 

sss/. and databases derrved from the current version of scop 
may be found at http://scop.mrc-lmb.cam.ac.uk/scop/ 
msSl? both databases were generally consistent, but 
PDB40D-B focuses on distantly related proteins and reduces the 

Jamiiies (31. 32). whereas pdbsod-b (with more sequences) 
improves evaluations of statistics. Except where noted other- 
Tnt' I J 513 "' homol °S resu '« here are from pdbmd-b 
A hough the precise numbers reported here are specific to the 
generaT d «»>>ases «"d. we expect thewends to be 

Assessment Data and Procedure. Our assessment of se- 
quence comparison may be divided into four different major 

ison algor.thm at a time, we evaluated the effectiveness of 
different scoring schemes. Second, we assessed the reliability 
of scoring procedures, including an evaluation of the validity 
of statical scoring Third, we compared sequence compart 
son algorithms (usmg the optimal scor.ng scheme) to deter- 

Zu,H , C " re f K" Ve P crformance - Fo «nh. we examined the 
dismbui.on of homologs and considered the power of pairwise 

^^"^T 3 " 50 "/ 0 reco «" ue 'hem- All of the analyses 
used the databases of structurally identified homologs and a 
new assessment criterion. 

»,^ a ," alySK ,es,ed BLACT (>>• v «»on 14.9MP. and wu- 
blast: (2). version 2.0al3MP Also assessed was the fasta 
package, version 3.0.76 (3). which provided Fasta and the 
ssearch implementation of Smith-Waterman (8) For 
ssearch and fasta. we used BLOSUM4S with gap penalties 

defau " P" ame,e « and matrix (BLO- 
SUM6; ) were used (or blast and wu- blast: 
The "Coverage Vs. Error" PloL To lest a particular protocol 

^T^'T 3 P ro * ram and $cor '"8 "heme), "ch sequence 
Irom the database was used as a query to search the database 
This yielded ordered pairs of querv and target sequences with 
associated scores, which were sorted, on the basis of their 
scores, from best to worst. The ideal method would have 
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^related pa.rs of sequence, consmen. with an' accept 

Our procedure involved measuring the coverage and error 

u^, VerV n' h ^ h0ld C0V " age Was def,ned * the' fractton o 
s. ucturally determined homologs that hive scores above the 
selected threshold; ,h.s reflects the sensitivitv c f a method 
Errors per query (EPQ). ,„ i„ dlcalor of seiec ,™™ 
number of nonhomologous pa.rs above the threshold d.vided 
by the number of quer.es. Graphs of these data, called 
coverage vs error plots, were dev,sed to understand how 



protocols compare at different levels of accuracy TW 

Th.s assessment procedure is d.rectlv relevant to oracticl 
sequence database search.ng. for ,« provide" preciwlv ? e 
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Fig. 4. Reliability of statistical scores in pdwod-b: Each line shows 
ihe relationship between reponed statistical score and actual error 
rate for a different program. E-values are reported for sseauch and 
facta, whereas P-vatues are shown for blast and wu-blactz If the 
scoring were perfect then the number of errors per query and the 
E-values would be the same, as indicated bv the upper bold line 
(P-values should be the same as EPO for small numbers, and diverges 
at higher values, as indicated by the lower bold line.) E-values from 
ssearch and Facta are snown to have good agreement with EPO but 
underestimate the significance slightly, blast and wu-blash are 
overconfident, with the degree of exaggeration dependent upon the 
score. The results for fDB-OD-B were similar to those for pdwod-b 
despite the difference in number of homologs detected. This graph 
could be used to roughly calibrate the reliability of a gjven statistical 
score. 

ignored in previous tests but is essential for the straiehrfonvard 
or automatic interpretation of sequence comparison results 
Further, u provides a clear indication of the confidence that 
should be ascribed to each match. Indeed, the EPQ measure 
should approximate the expectation value reponed bv data- 
base searching programs, if the programs' estimates are accu- 
rate. 

The Performance of Scoring Schemes. All of the programs 
tested could provide three fundamental types of scores. The 
first score is the percentage identity, which mav be computed 
in several ways based on either the length of the alignment or 
the lengths of the sequences. The second is a "raw" or 
"Smith-Waterman" score, which is the measure optimized by 
the Smith-Waterman algorithm and is computed bv summing 
the substitution matrix scores for each position in* the align- 
ment and subtracting gap penalties. In blast, a measure 
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related to this score is scaled into bus. Tmrd is a statistical 
score based on the extreme value distribution. These results 
are summarized in Fig. 1. 

Sequence Identity. Though it has been long established that 
percentage identity is a poor measure (35). there is a common 
rule-of-thumb stating that 30<Tc identirv signifies homology. 
Moreover, publications have indicated that 25% identirv can 
be used as a threshold (17. 36). We find that these thresholds, 
originally derived yean ago. are not supported bv present 
results. As databases have grown, so have the possibilities for 
chance alignments with high identitv: thus, the reported cutoffs 
lead to frequent errors. Fig. 2 shows one of the manv pairs of 
proteins with very different structures that nonetheless have 
high levels of identity over considerable aliened regions 
Despite the high identity, the raw and the statistical scores for 
such incorrect matches are typically not significant. The prin- 
cipal reasons percentage identity does so poorlv seem to be 
that it ignores information about gaps and about the conser- 
vative or radical nature of residue substitutions 

From the pdbwd-b analysis in Fig. 3. we learn that 30% 
identity is a reliable threshold for this database onlv for 
sequence alignments of at least 150 residues. Because' one 
unrelated pair of proteins has 43.5% identirv over 62 residues, 
it is probably necessary for alignments to be at least 70 residues 
m length before 40% is a reasonable threshold, for a database 
ot this particular size and composition. 

At a given reliability, scores based on percentage identity 
detect just a fraction of the distant homologs found by 
statistical scoring. If one measures the percentase identity in 
he aligned regions without consideration of alignment length, 
hen a negligible number of distant homoloes are detecfed 
Use of the hssp equation improves the value of percentage 
identity, but even this measure can find onlv 4% of all known 
homologs at 1% EPQ. In short, percentage identirv discards 
most of the information measured in a sequence comparison. 

Raw Scores. Smith-Waterman raw scores perform better 
than percentage identity (Fig. 1 ). but In-scaling (7) provided no 
notable benefit in our analysis. It is necessary to be verv precise 
when using either raw or bit scores because a 20% change in 
cutoff score could yield a tenfold difference in EPQ However 
it is difficult to choose appropriate thresholds because the 
reliability of a bit score depends on the lengths of the proteins 
matched and the size of the database. Raw score thresholds 
also are affected by matrix and gap parameters. 

Statistical Scores. Statistical scores were introduced partly 
to overcome the problems that arise from raw scores This 
scoring scheme provides the best discrimination between 
homologous proteins and those which are unrelated Most 
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likely, its power can be artributed 10 its incorporaiion of more 
m formation than any other measure; it takes account of the 
full substituti n and gap data (like raw scores) but also has 
detaUs about the sequence lengths and composition and is 
scaled appropriately. 

We find that statistical scores are not onrv powerful but also 
easy to interpret, ssearch and fasta show close agreement 
between statistical scores and actual number of errors oe 
query (Fig. 4). The expeciation value score gives a good 
slightly conservative estimate of the chances of the two 
quences being found at random in a given query Thus m 

E-v»lue of 0.01 indicates that roughly one pair of nonhomoiogs 
of this similarity should be found in every 100 different queries 
Neither raw scores nor percentage identity can be interpreted 
m this way w d these results validate the suitability of the 
extreme value distribution for describing the scores from a 
database search. 

The P-values from blast also should be directly interpret- 
able but were found to overstate significance bv more than two 
orders of magnitude for 1 % EPQ for this database Nonethe- 
less these results strongly suggest that the analytic theorv is 
fundamentally appropriate, wu-blast: scores were more're- 

i'^rV l °" inm BLAST> buI also '"gyrate expected 
confidence by more than an order of magnitude at 1% EPO 
Overall Detection of Homologs and Comparison of Ahjc- 
ritbms. The results in Fig. SA and Table 1 show that pairwue 
sequence comparison is capable of identifying only a small 
traction of the homologous pairs of sequences in FDB40D-B 
™ S ^ J *°* wi,h E-values. the best protocol tested, could 
find only 18% of all relat.onships at a 1% EPQ. blast which 
identifies 15%. was the worst performer, whereas "facta 
letup = 1 , s nearly as effective as ssearch. fasTa letup = t and 
wu-blast; are intermediate in their ability to detea ho- 
mology Comparison of different algorithms indicates that 
those capable of .dentifying more homologs are generally 
dower, ssearch is 25 times slower than blast and 6 5 tirnei 
slower than fasta ktup - 1. wu-blast: is slightly faster than 
fasta ktup = 2. but the latter has more intertable score? 

In pdbwd-b. where there are many close relationships the 
best method can identify only 38% of structurally known 
homologs (Fig. SB). The method which finds that m™ 
relationships is wu-blasti. Consequently, we infer that the 
differences between fasta kup - l. ssearch. and wu-blast 
programs are unlikely to be significant when compared with 
variation m database composition and scoring reliability 

Fig. 6 helps to explain why most distant homologs cannot be 
found by sequence comparison: a great manv such relation- 
snips have no more sequence identity than would be expected 
by chance, ssearch with E-values can recognize >90% of the 
homologous pairs with 30-40% identity. In this region, there 
are ju pairs of homologous proteins that do not have signif- 
icant E-values. but 26 of these involve sequences w.th <50 
residues. Of sequences having 25-30% identity. 75% are 
identified by ssearch E-values. However, although the num- 

foik «£°s?° T g T* aI lower levels of '*»«'*• the detection 
falls off sharply: only 40% of homologs with 20-25% identity 
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(ssearch with E^ilues) au% PK3 tT ™ «««h>n| method 
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panw.se sequence compinson to detect them.' " °' 
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Ihese results show that staitst.cal scores can find related 

proteins whose .den.tty ,s remarkably low; however the power 

wu * um - and *» h ^ EEL "S 

was Z/n, k and I" 8 ' ' t$ ° Vera " de,ec,ion of homologs 
was substani.ally better than thai of ungapped blast but noT 
quite equal to that of wu-blast:. 

CONCLUSION 

The general consensus amongst experts (see refs 7 24 25 "»7 
and references therein) sugges.s that the most effective 'se- 

. i which the protein sequences have been complexity masked 
and ,„) using staustical scores to m.erpre. trie results Our 
experiments fully support this view 

Our results also suggest two further points First the E-val- 
ues reported bv fasta and ssearch g,v e (atrlv accurate 
esfma.es of the sien.ficance of each match, bunhe P value! 
prov,ded bv blast and wu-blast: underest.mate the t™ 
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extent of errors. Second, ssearch, wu-blastz and fasta 
ktup « 1 perform best, though blast and fasta ktup « 2 
detect most of the relationships found by the best procedures 
and are appropriate for rapid initial searches. 

The homologous proteins that are found bv sequence com- 
parison can be distinguished with high reliability from the huge 
number of unrelated pairs. However, even the best database 
searching procedures tested fail to find the large majority of 
distant evolutionary relationships at an acceptable error lite 
Thus, if the procedures assessed here fail to find a reliable 
match, it does not imply that the sequence is unique: rather it 
indicates that any relatives it might have are distant ones/- 

••Additional and updated information about ibis work, including 
supplementary figures, may be found at htipV/mjtinford.edu/ sttA 
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